Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix private to public #582

Merged
merged 8 commits into from
Sep 23, 2022
Merged

Fix private to public #582

merged 8 commits into from
Sep 23, 2022

Conversation

severo
Copy link
Collaborator

@severo severo commented Sep 22, 2022

See #380 (comment):

private -> public does not generate a webhook, it seems like a bug on the Hub. (btw: turning a public repo to private generates a "update" webhook, conversely)

and

https://github.com/huggingface/moon-landing/issues/2362#issuecomment-1254774183

Meanwhile, in datasets-server, I will fix the issue by using the request (that generates the SplitsResponseNotFound error) to trigger a cache refresh if needed, so: not urgent

In this PR, if a response to /splits or /first-rows is NotFound but should have existed, ie. is a cache miss, we add a job to the queue, and return a NotReady error instead.

We now use the webhook only to get the name of the dataset to handle.
Then, if the dataset is supported (exists, is public, can be gated or
not), update the cache; otherwise, delete the cache.
When a request comes to splits/, if no cache entry exist for the
dataset, and if no job is pending either, we now check on the Hub
if the dataset exists and is supported (ie. not private).
If so, it means that we somehow missed a webhook. We add a job to
the queue, and we respond with a "NotReady" error.
Otherwise, we respond with "NotFound".

This change is a quick fix for #380 (comment)

Also: to be able to test, we changed the test library "responses"
to a proper HTTP server ("pytest-httpserver" library), which
gives us more flexibility to mock the Hub APIs (both the auth endpoint
and the API datasets endpoint) and test even if we use huggingface_hub
lib in the code.

Also: the tests are a bit more precise (more cases)
also: dataset_name -> dataset. also: fix style
If a request is made to /first-rows and there is no entry in the
cache, we check if it should.

Various cases have to be managed, which makes the logic a bit (too)
convoluted.

We return a FirstRowsNotReady error if:
- the /first-rows response for the split is already in process
- the /splits response for the dataset is still in process

We return a FirstRowsNotReady error, AND we add a /splits job because
something has failed before, if:
- the /splits response for the dataset exists, and the split is part
  of the splits of the dataset
- the /splits response for the dataset does not exist, is not in
  process, but it should because the dataset is supported.

Note that I didn't add a unit test for every of these cases
@severo severo marked this pull request as ready for review September 23, 2022 09:56
@severo severo merged commit b7a34ec into main Sep 23, 2022
@severo severo deleted the fix-private-to-public branch September 23, 2022 11:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant